Skip to content

Conversation

@zhangbutao
Copy link
Contributor

@zhangbutao zhangbutao commented Sep 19, 2025

What changes were proposed in this pull request?

This PR first implements the functionality for clients to switch catalogs, such as SET CATALOG testcat;. Subsequently, created databases or retrieved database lists will all reside under the testcat catalog.
The primary changes in this PR involve modifying database-related DDL interfaces to ensure that operations like CREATE/ALTER/DROP DATABASE respect the client's current catalog.
You can check the q file catalog_database.q to see some syntax about cat.db. Such as:

-- CREATE DATABASE in new catalog testcat by catalog.db pattern
CREATE DATABASE testcat.testdb_1;
-- Switch database by catalog.db pattern
USE testcat.testdb_1;

-- Drop database by catalog.db pattern
DROP DATABASE testcat.testdb_1;

This PR added some TODO catalog labels, as some catalog tasks need to be fixed in other tickets. You can search TODO catalog to find which task should be done later.

Additionally, this PR only addresses the interfaces for switching catalogs and databases, and does not cover interfaces related to cat.db.table. Therefore, even if the catalog and database are switched, the current client will still query tables from the default hive catalog. The code related to cat.db.tbl (part of the work in HIVE-29177) will also involve many table-related interfaces, which presents certain refactoring challenges.

Why are the changes needed?

We need to make database work with catalog to support multiple catalogs in client simultaneously.

Does this PR introduce any user-facing change?

No. Users still can use classical syntax such as use dbname.

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile='catalog_database.q' -Dtest.output.overwrite=true -pl itests/qtest -Pitests

@zhangbutao zhangbutao marked this pull request as ready for review October 21, 2025 08:29
@sonarqubecloud
Copy link

RESOURCE_PLAN_NOT_EXISTS(10418, "Resource plan {0} does not exist", true),
INCOMPATIBLE_STRUCT(10419, "Incompatible structs.", true),
OBJECTNAME_CONTAINS_DOT(10420, "Table or database name may not contain dot(.) character", true),
OBJECTNAME_CONTAINS_DOT(10420, "Catalog or table or database name may not contain dot(.) character", true),
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use comma: Catalog, database or table names


private final String catalogName;

public SwitchCatalogDesc(String databaseName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catalogName ?

public CreateDatabaseDesc(String catalogName, String databaseName, String comment, String locationUri, String managedLocationUri,
boolean ifNotExists, Map<String, String> dbProperties) {
this(databaseName, comment, locationUri, managedLocationUri, ifNotExists, dbProperties, "NATIVE", null, null);
this(catalogName, databaseName, comment, locationUri, managedLocationUri, ifNotExists, dbProperties, "NATIVE", null, null);
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the external catalogs dbtype should be "REMOTE", right? It seems that we always hardcode "NATIVE" and below condition never returns true

 if (dbtype != null && dbtype.equalsIgnoreCase("REMOTE")) {

Copy link
Contributor Author

@zhangbutao zhangbutao Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the external catalogs dbtype should be "REMOTE", right?

I don't think so. REMOTE dbtyp was introduced by HIVE-24396 data connector. If we implement all tasks about federated catalog, we can deprecate data connector. Because the data connector can only map external databases to internal databases by manually creating remote databases one by one to query external data sources, like this:
CREATE REMOTE DATABASE db_map_mysql USING mysqlconnector with DBPROPERTIES("connector.remoteDbName"="testmysqldb");

whereas the federated catalog can retrieve all external databases at once without the need to manually create remote databases, once we create a catalog.

It seems that we always hardcode "NATIVE" and below condition never returns true

If you create a remote database with a specified dataconnector, this condition will return true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we introduce a flag like native/non-native similar to tables?

throw new SemanticException(ErrorMsg.CATALOG_NOT_EXISTS, catalogName);
}
String databaseName = catDbNamePair.getRight();
Database database = getDatabase(catDbNamePair.getLeft(), catDbNamePair.getRight(), !ifExists);
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use local vars?

getDatabase(catalogName, databaseName, !ifExists) 

// Unlock database operation is to release the lock explicitly, the operation itself don't need to be locked.
// Set the WriteEntity as WriteType: DDL_NO_LOCK here, otherwise it will conflict with Hive's transaction.
outputs.add(new WriteEntity(getDatabase(databaseName), WriteType.DDL_NO_LOCK));
outputs.add(new WriteEntity(getDatabase(catalogName, databaseName, true), WriteType.DDL_NO_LOCK));
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE_LOCKS table doesn't have catalog column.
Maybe some operation like Unlock/Lock DB, ShowDbLocks should only be supported by NATIVE DBs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE_LOCKS table doesn't have catalog column.

Currently, this only retrieves a specific database under a particular catalog and does not involve any lock/unlock operations on the catalog.
However, I believe future development should consider implementing lock/unlock operations for catalogs, though I haven't yet identified the specific use cases. It might be necessary during cross-catalog federated queries.

Maybe some operation like Unlock/Lock DB, ShowDbLocks should only be supported by NATIVE DBs?

To be honest, I haven't fully thought this through. Locking/unlocking databases in external catalogs (such as a MySQL JDBC catalog) doesn't seem very meaningful, as these lock operations won't take effect on the MySQL source side. However, I believe allowing locks on databases in external catalogs in Hive might have some significance, particularly in scenarios involving cross-catalog and cross-database operations, to ensure task transactional consistency. Therefore, let's further explore this issue when we work on cross-catalog queries. An already created ticket, HIVE-29242, can be used to track this matter.

throw new HiveException("New lock format only supported with db lock manager.");
}

// TODO catalog. Need to add catalog into ShowLocksRequest. But ShowLocksRequest doesn't have catalog field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied in #6088 (comment)

Map<String, String> props = new HashMap<>();
props.put(READONLY, Boolean.TRUE.toString());

// TODO catalog. Need to double check the actual catalog here. Depend on HIVE-29278
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we can replicate not NATIVE DBs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Some task can only do in NATIVE DBs. Like this ReploadTask.

This TODO is added to verify if the correct native catalog name is obtained. BTW, besides the default hive native catalog, we still need to support other multiple native catalogs.

As for external catalogs, we need to skip those tasks that are exclusive to native catalogs. However, the specific types of external catalogs (e.g., JDBC, REST) have not yet been defined, and this part of the work can be refined and completed subsequently. HIVE-29243 can be used to track this matter.

AlterDatabaseSetPropertiesDesc alterDbDesc = new AlterDatabaseSetPropertiesDesc(dbName, mapProp,
new ReplicationSpec(replState, replState));
// TODO catalog. Need to double check the actual catalog here. Depend on HIVE-29278
AlterDatabaseSetPropertiesDesc alterDbDesc = new AlterDatabaseSetPropertiesDesc(HiveUtils.getCurrentCatalogOrDefault(conf),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshal-16 could you please validate if this is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked the details in the main JIRA: https://issues.apache.org/jira/browse/HIVE-28879
As ACID tables can only reside in Hive (default), other catalogs don't make sense in case of replication here
Also, during dump operation itself, it filters only default catalog here:

new CatalogFilter(MetaStoreUtils.getDefaultCatalog(conf)),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking, @harshal-16 ! what if we register multiple external HMS catalogs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then I think we will need to update the incremental dump filter also to make it consistent with the load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE-29278 is used to track this matter. I think we can make the rep work with multiple native hms catalog.


HiveLock lck = lockMgr.lock(new HiveLockObject(dbObj.getName(), lockData), mode, true);
// Using the catalogName@databaseName format to uniquely identify a database.
HiveLock lck = lockMgr.lock(new HiveLockObject(catName + "@" +dbObj.getName(), lockData), mode, true);
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nice hack! however, maybe we should use "." ?
I think it might actually work. i am not sure if show locks would handle this change, might need some tweaks:

show locks db1
show locks cat1.db1

note, might be problematic if CU has multiple HMS versions on a cluster

nit: please add space + dbObj.getName()

throw new HiveException("Database " + dbName + " does not exist ");
}
HiveLockObject obj = new HiveLockObject(dbObj.getName(), null);
HiveLockObject obj = new HiveLockObject(catName + "@" +dbObj.getName(), null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space

* @throws HiveException
* @throws NoSuchObjectException
*/
@Deprecated
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add to java doc "@deprecated since 4.1.0, will be removed in 5.0.0"


// TODO: this whole path won't work with catalogs
/**
* When you call this method, you need to ensure that the catalog has been set in the db object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

public List<Table> getTableObjects(String catName, String dbName, String pattern, TableType tableType) throws HiveException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dengzhhu653 is it ok or we should use some request object like Database?

* @return list of table names that match the pattern.
* @throws HiveException
*/
public List<String> getTablesByType(String catName, String dbName, String pattern, TableType type)
Copy link
Member

@deniskuzZ deniskuzZ Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as https://github.com/apache/hive/pull/6088/files#r2454680968
can we reuse it in getTablesByType(String dbName, String pattern, TableType type) ?

}
}

public List<Function> getFunctionsInDb(String catName, String dbName, String pattern) throws HiveException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

public static String getCurrentCatalogOrDefault(Configuration conf) {
return SessionState.get() != null ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could try Optional.ofNullable

return this.tTable.getCatName();
}

public void setCatalogName(String catalogName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this setter on a Table object?

return database;
}

protected Database getDatabase(String catalogName, String dbName, boolean throwException) throws SemanticException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try reuse code. you could call new method within the old one without catalogName

Database db = metaData.getDatabase();
String destinationDBName =
context.dbName == null ? db.getName() : context.dbName;
String destinationCatalogName = db.getCatalogName(); // TODO catalog. Need to double check the catalog here. Depend on HIVE-29278
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else if (command.length > 1 && "show".equalsIgnoreCase(command[0]) &&
"processlist".equalsIgnoreCase(command[1])) {
return PROCESSLIST;
} else if(command.length > 1 && "set".equalsIgnoreCase(command[0]) && "catalog".equalsIgnoreCase(command[1])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space

@@ -0,0 +1,9 @@
set hive.lock.numretries=0;
set hive.support.concurrency=true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is only needed for ACID tables when DBTxnManager is used

DESC CATALOG EXTENDED test_cat;

-- DROP catalog at the end
DROP CATALOG test_cat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant space

@deniskuzZ
Copy link
Member

thanks @zhangbutao! in general LGTM, added some comments

@dengzhhu653
Copy link
Member

dengzhhu653 commented Nov 5, 2025

Sorry for the late reply.
Took a quick overview of the changes, looks we introduce a new syntax for the catalog, as well as the future items:

The code related to cat.db.tbl (part of the work in HIVE-29177) will also involve many table-related interfaces, which presents certain refactoring challenges.

let's further explore this issue when we work on cross-catalog queries. An already created ticket, HIVE-29242, can be used to track this matter.

It seems to me this is a significant/big change, and the user should follow a new way of catalog.db.table to request the table as the Trino/Presto does today.

I think we can have a more simple idea, let's push the catalog awareness down into the Metastore client. For example,
SET CATALOG testcat; -> change to a metaconf setting: set metaconf:metastore.catalog.default = testcat, default hive, this configuration will be propagated to HMS, every HMS API request will be appended with the testcat, both client and server can handle it.

For cross-catalog queries, we can also do it in a similar way, we can introduce a new catalog awareness lawyer between session client and the thrift client, as SessionHiveMetaStoreClient -> CatalogAwarenessMetastoreClient -> ThriftHiveMetaStoreClient. The CatalogAwarenessMetastoreClient can acknowledge of the catalog through configuration or properties stored in HMS(HIVE-27186). Something as hive.matastore.catalog.mapping.pattern=iceberg_catalog:ice_db1*,hive_db1,hive_db2.ice_tab*;hive:default,acid_db*, the CatalogAwarenessMetastoreClient even can point to a third-party catalog partner.

By this way, we don't need to introduce the change on the Hive SQL syntax and test it throughly, and other engines(such as Impala) can also benefit from this way to add the support on cross-catalog queries.

@dengzhhu653
Copy link
Member

dengzhhu653 commented Nov 5, 2025

The biggest flaw is we cannot have two same db name/table name among catalogs, this might be a question in the future.
For cross-catalog queries e.g, db1.name1 join db1.name1 in this case, CatalogAwarenessMetastoreClient mightn't be able to figure out whether the all db1.name1 tables belong to the same catalog or not.
I don't have an elegant solution yet, maybe we can add a limit that the db is unique across the catalogs, which as we do in the Hive catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants